05/28/2020
Understanding NGS file formats
Understanding NGS quality assessment
FASTA format reports a sequence
Can contain protein sequences or nucleic acid sequences
FASTA format reports a sequence
Can contain protein sequences or nucleic acid sequences
Common applications include
FASTA format reports a sequence
Can contain protein sequences or nucleic acid sequences
Common applications include
Nearly everything works with this format. Some common examples are:
| Phred quality score (Q) | Probability of incorrect call (P) | Base call accuracy |
|---|---|---|
| 10 | 1 in 10 | 90% |
| 20 | 1 in 100 | 99% |
| 30 | 1 in 1000 | 99.9% |
| 40 | 1 in 10000 | 99.99% |
| 50 | 1 in 100000 | 99.999% |
Sequence Alignment Map (SAM)
Sequence Alignment Map (SAM)
Sequence Alignment Map (SAM)
Sequence Alignment Map (SAM)
Sequence Alignment Map (SAM)
BAM - compressed searchable binary SAM
CRAM - even smaller compressed searchable binary SAM
Sequence Alignment Map (SAM)
Fully Described in a specification
Complex header - many optional fields Some include:
Where is it used?
Sequence Alignment Map (SAM)
11 manditory fields
Flags can tell you about each read, and allow for summaries on the file, and filtering.
CIGAR can encode the alignment
MAPQ can encode the Mapping Quality
\( -10 log_{10} Pr \{mapping\ position\ is\ wrong\} \)
255 indicates no mapping quality is availible
CIGAR can encode the alignment
CIGAR can encode the alignment
BED (Browser Extensible Data)
Simple format to describe intervals on the genome
Simple format to describe intervals on the genome
Basic form is 3 columns
start is 0 based
end is 1 based
the first 100 bases on chromosome 1 would be represented with
chr1 0 100
and the next 100 bases
chr1 100 200
Optional Fields:
Name, Score, Strand, thickStart, thickEnd, itemRgb, blockCount, blockSizes, blockStarts
Optional Fields
Name, Score, Strand, thickStart, thickEnd, itemRgb, blockCount, blockSizes, blockStarts
Optional Fields
Name, Score, Strand, thickStart, thickEnd, itemRgb, blockCount, blockSizes, blockStarts
Optional Fields
Name, Score, Strand, thickStart, thickEnd, itemRgb, blockCount, blockSizes, blockStarts
What software use bed files?
GTF (Gene Transfer Format)
Mostly used to describe Genes.
GTF (Gene Transfer Format)
Mostly used to describe Genes.
First 8 fields are required:
¯\_(ツ)_/¯GTF (Gene Transfer Format)
Mostly used to describe Genes.
First 8 fields are required:
9th column
Required:
gene_id “ENSG00000227232.5”; transcript_id “ENST00000488147.1”;
Optional:
gene_type “unprocessed_pseudogene”; gene_name “WASH7P”;
transcript_type “unprocessed_pseudogene”; transcript_name “WASH7P-001”;
exon_number 11; exon_id “ENSE00001843071.1”;
level 2; transcript_support_level “NA”;
ont “PGO:0000005”; tag “basic”;
havana_gene “OTTHUMG00000000958.1”; havana_transcript “OTTHUMT00000002839.1”;
VCF (Variant Calling Format)
Describes SNVs and INDELs
VCF (Variant Calling Format)
Describes SNVs and INDELs
VCF (Variant Calling Format)
Another complex format, but has an official specification
8 required Fields
VCF (Variant Calling Format)
8 required Fields:
VCF (Variant Calling Format)
8 required Fields: